Search CORE

26 research outputs found

EVEREST: a collection of evolutionary conserved protein domains

Author: Linial Michal
Linial Nathan
Portugaly Elon
Publication venue: Oxford University Press
Publication date: 01/01/2006
Field of study

Protein domains are subunits of proteins that recur throughout the protein world. There are many definitions attempting to capture the essence of a protein domain, and several systems that identify protein domains and classify them into families. EVEREST, recently described in Portugaly et al. (2006) BMC Bioinformatics, 7, 277, is one such system that performs the task automatically, using protein sequence alone. Herein we describe EVEREST release 2.0, consisting of 20 029 families, each defined by one or more HMMs. The current EVEREST database was constructed by scanning UniProt 8.1 and all PDB sequences (total over 3 000 000 sequences) with each of the EVEREST families. EVEREST annotates 64% of all sequences, and covers 59% of all residues. EVEREST is available at . The website provides annotations given by SCOP, CATH, Pfam A and EVEREST. It allows for browsing through the families of each of those sources, graphically visualizing the domain organization of the proteins in the family. The website also provides access to analyzes of relationships between domain families, within and across domain definition systems. Users can upload sequences for analysis by the set of EVEREST families. Finally an advanced search form allows querying for families matching criteria regarding novelty, phylogenetic composition and more

CiteSeerX

Crossref

PubMed Central

Hidden Markov model speed heuristic and iterative HMM search procedure

Author: Eddy Sean R
Johnson L Steven
Portugaly Elon
Publication venue: Digital Commons@Becker
Publication date: 01/01/2010
Field of study

BACKGROUND: Profile hidden Markov models (profile-HMMs) are sensitive tools for remote protein homology detection, but the main scoring algorithms, Viterbi or Forward, require considerable time to search large sequence databases. RESULTS: We have designed a series of database filtering steps, HMMERHEAD, that are applied prior to the scoring algorithms, as implemented in the HMMER package, in an effort to reduce search time. Using this heuristic, we obtain a 20-fold decrease in Forward and a 6-fold decrease in Viterbi search time with a minimal loss in sensitivity relative to the unfiltered approaches. We then implemented an iterative profile-HMM search method, JackHMMER, which employs the HMMERHEAD heuristic. Due to our search heuristic, we eliminated the subdatabase creation that is common in current iterative profile-HMM approaches. On our benchmark, JackHMMER detects 14% more remote protein homologs than SAM's iterative method T2K. CONCLUSIONS: Our search heuristic, HMMERHEAD, significantly reduces the time needed to score a profile-HMM against large sequence databases. This search heuristic allowed us to implement an iterative profile-HMM search method, JackHMMER, which detects significantly more remote protein homologs than SAM's T2K and NCBI's PSI-BLAST

Springer - Publisher Connector

Digital Commons@Becker

PubMed Central

EVEREST: automatic identification and classification of protein domains in all protein sequences

Author: A Bairoch
A Barak
A Bateman
A Heger
Amir Harel
B Boeckmann
CH Wu
E Portugaly
E Portugaly
Elon Portugaly
F Servant
HM Berman
J Gracy
J Gracy
J Liu
J Liu
J Park
J Schultz
JD Thompson
JM Chandonia
Michal Linial
N Kaplan
N Nagarajan
Nathan Linial
NJ Mulder
O Dekel
O Sasson
O Sasson
O Shachar
SF Altschul
SR Eddy
TF Smith
TJ Hubbard
Y Inbar
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again. RESULTS: Processing the Swiss-Prot section of the UniProt Knoledgebase, release 7.2, EVEREST defines 20,230 domains, covering 85% of the amino acids of the Swiss-Prot database. EVEREST annotates 11,852 proteins (6% of the database) that are not annotated by Pfam A. In addition, in 43,086 proteins (20% of the database), EVEREST annotates a part of the protein that is not annotated by Pfam A. Performance tests show that EVEREST recovers 56% of Pfam A families and 63% of SCOP families with high accuracy, and suggests previously unknown domain families with at least 51% fidelity. EVEREST domains are often a combination of domains as defined by Pfam or SCOP and are frequently sub-domains of such domains. CONCLUSION: The EVEREST process and its output domain families provide an exhaustive and validated view of the protein domain world that is automatically generated from sequence data. The EVEREST library of domain families, accessible for browsing and download at [1], provides a complementary view to that provided by other existing libraries. Furthermore, since it is automatic, the EVEREST process is scalable and we will run it in the future on larger databases as well. The EVEREST source files are available for download from the EVEREST web site

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Hidden Markov model speed heuristic and iterative HMM search procedure

Author: AA Schaffer
ED Scheeff
Elon Portugaly
GA Price
JM Chandonia
K Karplus
L Holm
L Lo Conte
L Steven Johnson
M Madera
RD Finn
SE Brenner
SE Brenner
Sean R Eddy
SF Altschul
SF Altschul
WN Grundy
WR Pearson
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

HMMERHEAD - Accelerating HMM Searches On

Author: Elon Portugaly
Large Databases Elon
Matan Ninio
Publication venue
Publication date
Field of study

Introduction HMMs have been proven useful in protein sequence analysis [1]. However, a full search of a sequence database using an HMM is a computationally expensive process - running all the Pfam [3] HMMs on the SWISS-PROT database [4] takes almost three months of computer time. The two-hit method used by Altschul et al [2] allows BLAST to accelerate both sequence vs. sequence searches and profile vs. sequence searches. In this work we build a framework that uses a similar method for HMM searches. We provide HMMER Hashing Enabled Acceleration Device (HMMERHEAD) - a software package that filters out sequences for hmmsearch. Our experiments show that we typically achieve a 15-fold acceleration of running time, while retaining 99% of the results. 2 The Two-Hit Method The two-hit method was introduced in [2]. Following is a short description of the method. Preprocessing: In a preprocessing step, a database of k-mers (i.e. words of size k over the alphabet used) is compiled from the

CiteSeerX